{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Renumber\n",
    "\n",
    "In this notebook, we will use the _renumber_ function to compute new vertex IDs.\n",
    "\n",
    "Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguious, or which start at a large initial value, then there is a lot of empty space that uses up memory.      \n",
    "\n",
    "An alternative case is using renumbering to convert from one data type down to a contiguious sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  \n",
    "\n",
    "\n",
    "Notebook Credits\n",
    "* Original Authors: Bradley Rees\n",
    "* Created:   08/13/2019\n",
    "* Updated:   03/03/2020\n",
    "\n",
    "RAPIDS Versions: 0.13    \n",
    "\n",
    "Test Hardware\n",
    "\n",
    "* GV100 32G, CUDA 10.2\n",
    "\n",
    "\n",
    "## Introduction\n",
    "The renumber function takes an edge list (source, destination) and renumbers the vertices so that the index start at 0 and are contiguious.  The function also converts the data type to return int32\n",
    "\n",
    "To renumber an edge list (COO data) use:<br>\n",
    "\n",
    "**cugraph.renumber(source, destination)**\n",
    "* __source__: cudf.Series\n",
    "* __destination__: cudf.Series\n",
    "\n",
    "\n",
    "Returns:\n",
    "* __triplet__: three variables are returned:\n",
    "    * 'src': the new source vertex IDs\n",
    "    * 'dst': the new destination IDs\n",
    "    * 'mapping': a mapping of new IDs to original IDs.  Since the new IDs are sequencial from 0, the index value represents the new vertex ID\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test Data\n",
    "A cyber data set from the University of New South Wales is used, where just the IP edge pairs from been extracted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import needed libraries\n",
    "import cugraph\n",
    "import cudf\n",
    "import nvstrings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the data\n",
    "# the file contains an index column that will be ignored\n",
    "\n",
    "datafile='../data/cyber.csv'\n",
    "\n",
    "gdf = cudf.read_csv(datafile, delimiter=',', names=['idx','srcip','dstip'], dtype=['int32','str', 'str'], skiprows=1, usecols=['srcip', 'dstip'] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look at the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>srcip</th>\n",
       "      <th>dstip</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>﻿59.166.0.0</td>\n",
       "      <td>149.171.126.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>59.166.0.0</td>\n",
       "      <td>149.171.126.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>59.166.0.6</td>\n",
       "      <td>149.171.126.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>59.166.0.5</td>\n",
       "      <td>149.171.126.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>59.166.0.3</td>\n",
       "      <td>149.171.126.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         srcip          dstip\n",
       "0  ﻿59.166.0.0  149.171.126.6\n",
       "1   59.166.0.0  149.171.126.9\n",
       "2   59.166.0.6  149.171.126.7\n",
       "3   59.166.0.5  149.171.126.5\n",
       "4   59.166.0.3  149.171.126.0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# take a peek at the data\n",
    "gdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Since IP columns are strings, we first need to convert them to integers\n",
    "gdf['src_ip'] = gdf['srcip'].str.ip2int()\n",
    "gdf['dst_ip'] = gdf['dstip'].str.ip2int()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "edges: 2546575\n",
      "max: 3758096389 min: 0 range: 3758096390\n"
     ]
    }
   ],
   "source": [
    "# look at that data and the range of values\n",
    "maxT = max(gdf['src_ip'].max(), gdf['dst_ip'].max())\n",
    "minT = min(gdf['src_ip'].min(), gdf['dst_ip'].min())\n",
    "\n",
    "r = maxT - minT +1\n",
    "print(\"edges: \" + str(len(gdf)))\n",
    "print(\"max: \" + str(maxT) + \" min: \" + str(minT) + \" range: \" + str(r))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data has 2.5 million edges that span a range of 3,758,096,389 \n",
    "Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there.  \n",
    "In the curret state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Time to Renumber\n",
    "One good best practice is to have the returned edge pairs appended to the original dataframe. That will help merge results back into the datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf['src_r'], gdf['dst_r'], numbering = cugraph.renumber(gdf['src_ip'], gdf['dst_ip'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>srcip</th>\n",
       "      <th>dstip</th>\n",
       "      <th>src_ip</th>\n",
       "      <th>dst_ip</th>\n",
       "      <th>src_r</th>\n",
       "      <th>dst_r</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>﻿59.166.0.0</td>\n",
       "      <td>149.171.126.6</td>\n",
       "      <td>1000734720</td>\n",
       "      <td>2511044102</td>\n",
       "      <td>40</td>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>59.166.0.0</td>\n",
       "      <td>149.171.126.9</td>\n",
       "      <td>1000734720</td>\n",
       "      <td>2511044105</td>\n",
       "      <td>40</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>59.166.0.6</td>\n",
       "      <td>149.171.126.7</td>\n",
       "      <td>1000734726</td>\n",
       "      <td>2511044103</td>\n",
       "      <td>46</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>59.166.0.5</td>\n",
       "      <td>149.171.126.5</td>\n",
       "      <td>1000734725</td>\n",
       "      <td>2511044101</td>\n",
       "      <td>45</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>59.166.0.3</td>\n",
       "      <td>149.171.126.0</td>\n",
       "      <td>1000734723</td>\n",
       "      <td>2511044096</td>\n",
       "      <td>43</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         srcip          dstip      src_ip      dst_ip  src_r  dst_r\n",
       "0  ﻿59.166.0.0  149.171.126.6  1000734720  2511044102     40     21\n",
       "1   59.166.0.0  149.171.126.9  1000734720  2511044105     40     24\n",
       "2   59.166.0.6  149.171.126.7  1000734726  2511044103     46     22\n",
       "3   59.166.0.5  149.171.126.5  1000734725  2511044101     45     20\n",
       "4   59.166.0.3  149.171.126.0  1000734723  2511044096     43     15"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now look at the renumbered ranged of values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "edges: 2546575\n",
      "max: 49 min: 0 range: 50\n"
     ]
    }
   ],
   "source": [
    "# look at that data and the range of values\n",
    "maxT = max(gdf['src_r'].max(), gdf['dst_r'].max())\n",
    "minT = min(gdf['src_r'].min(), gdf['dst_r'].min())\n",
    "\n",
    "r = maxT - minT + 1\n",
    "print(\"edges: \" + str(len(gdf)))\n",
    "print(\"max: \" + str(maxT) + \" min: \" + str(minT) + \" range: \" + str(r))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just saved 3.7 billion unneeded spaces in the matrix!<br>\n",
    "And we can now see that there are only 50 unique IP addresses in the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2019, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cugraph_dev",
   "language": "python",
   "name": "cugraph_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
